Analyzing Features for the Detection of Happy Endings in German Novels

نویسندگان

  • Fotis Jannidis
  • Isabella Reger
  • Albin Zehe
  • Martin Becker
  • Lena Hettinger
  • Andreas Hotho
چکیده

With regard to a computational representation of literary plot, this paper looks at the use of sentiment analysis for happy ending detection in German novels. Its focus lies on the investigation of previously proposed sentiment features in order to gain insight about the relevance of specific features on the one hand and the implications of their performance on the other hand. Therefore, we study various partitionings of novels, considering the highly variable concept of "ending". We also show that our approach, even though still rather simple, can potentially lead to substantial findings relevant to literary studies. Introduction Plot is fundamental for the structure of literary works. Methods for the computational representation of plot or special plot elements would therefore be a great achievement for digital literary studies. This paper looks at one such element: happy endings. We employ sentiment analysis for the detection of happy endings, but focus on a qualitative analysis of specific features and their performance in order to gain deeper insight into the automatic classification. In addition, we show how the applied method can be used for subsequent research questions, yielding interesting results with regard to publishing periods of the novels. Related Work One of the first works was on folkloristic tales, done by Mark Finlayson, who created an algorithm capable of detecting events and higher-level abstractions, such as villainy or reward (Finlayson 2012). Reiter et al., again on tales, identify events, their participants and order and use machine learning methods to find structural similarities across texts (Reiter 2013, Reiter et al. 2014). Recently, a significant amount of attention has been paid to sentiment analysis, when Matthew Jockers proposed emotional arousal as a new “method for detecting plot” (Jockers 2014). He described his idea to split novels into segments and use those to form plot trajectories (Jockers 2015). Despite general acceptance of the idea to employ sentiment analysis, his use of the Fourier Transformation to smooth the resulting plot curves was criticized (Swafford 2015, Schmidt 2015). Among other features, Micha Elsner (Elsner 2015) builds plot representations of romantic novels, again by using sentiment trajectories. He also links such trajectories with specific characters and looks at character co-occurrences. To evaluate his approach, he distinguishes real novels from artificially reordered surrogates with considerable success, showing that his methods indeed capture certain aspects of plot structure. In previous work, we used sentiment features to detect happy endings as a major plot element in German novels, reaching an F1-score of 73% (Zehe et al. 2016). Corpus and Resources Our dataset consists of 212 novels in German language mostly from the 19th century . Each 1 novel has been manually annotated as either having a happy ending (50%) or not (50%). The relevant information has been obtained from summaries of the Kindler Literary Lexikon Online and Wikipedia. If no summary was available, the corresponding parts of the novel 2 have been read by the annotators. Sentiment analysis requires a resource which lists sentiment values that human readers typically associate with certain words or phrases in a text. This paper relies on the NRC Sentiment Lexicon (Mohammad and Turney 2013), which is available in an automatically translated German version . A notable feature of this lexicon is that besides specifying binary 3 values (0 or 1) for negative and positive connotations (2 features) it also categorizes words into 8 basic emotions (anger, fear, disgust, surprise, joy, anticipation, trust and sadness), see Table 1 for an example. We add another value (the polarity) by subtracting the negative from the positive value (e.g. a word with a positive value of 0 and a negative value of 1 has a polarity value of -1). The polarity serves as an overall sentiment score, which results in 11 features. Table 1 ​ : Example entries from the NRC Sentiment Lexicon Word/Dimension verabscheuen (to detest) bewundernswert (admirable) Zufall (coincidence) Positive 0 1 0 Negative 1 0 0 Polarity -1 1 0 Anger 1 0 0 Anticipation 0 0 0 Disgust 1 0 0 Fear 1 0 0 Joy 0 1 0 Sadness 0 0 0 Surprise 0 0 1 Trust 0 1 0 Experiments The goal of this paper is to investigate features that have been used for the detection of happy endings in novels in order to gain insight about the relevance of specific feature sets on the one hand and the implications of their performance on the other hand. To that end, we adopt the features and methods presented in Zehe et al. (2016). The parameters of the linear SVM and the partitioning into 75 segments are also adopted from this paper. 1 Source: https://textgrid.de/digitale-bibliothek 2 www.kll-online.de 3 http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm Features. Since reliable chapter annotations were not available, each novel has been split into 75 equally sized blocks, called ​segments ​ . For each lemmatized word, we look up the 11 sentiment values (including polarity, see above). Then, for each segment, we calculate the respective averages, resulting in 11 scores per segment. We group those 11 scores into one feature set. Qualitative Feature Analysis ​ . As our corpus consists of an equal number of novels with and without happy ending, the random baseline as well the majority vote baseline amount to 50% classification accuracy. Since we assumed that the relevant information for identifying happy endings can be found at the end of a novel, we first used the sentiment scores of the final segment ( ) as the fd,n only feature set, reaching an F1-score of 67%. Following the intuition that not only the last segment by itself, but also its relation to the rest of the novel are meaningful for the classification, we introduced the notion of ​sections ​ : the last segment of a novel constitutes the ​final section ​ , whereas the remaining segments belong to the ​main section ​ . Averages were also calculated for the sections by taking the mean of each feature over all segments in the section. To further emphasize the relation between these sections, we added the differences between the sentiment scores of the final section and the average sentiment scores over all segments in the main section. However, this change did not influence the results. This led us to believe that our notion of an “ending” was not accurate enough, as the number of segments for each novel and therefore the boundaries of the final segment have been chosen rather arbitrarily. To approach this issue, we varied the partitioning into main and final section so that the final section can contain more than just the last segment. Figure 1​: ​Classification F1-score for different partitionings into main and final section. The dashed line represents a random baseline, the dotted line shows where the maximum F1-score is reached. Figure 1 shows that classification accuracy improves when at least 75% of the segments are in the main section and reaches a peak at about 95% (this means 4 segments in the final section and 71 segments in the main section, for a total of 75 segments). With this partitioning strategy, we improve the F1-score to 68% using only the feature set for the final section ( ) and reach an F1-score of 69% when also including the differences to the fd, f inal average sentiment scores of the main section ( ). fd, main−f inal Since adding the relation between the main section and the final section improved our results in the previous setting, we tried to model the development of the sentiments towards the end of the novel in a more profound way. For example, a catastrophic event might happen shortly before the end of a novel and finally be resolved in a happy ending. To capture this intuition, we introduced one more section, namely the ​late-main section, which focuses on the segments right ​before the final section, and used the difference between the feature sets for the late-main and the final section as an additional feature set ( ). fd, late−f inal Using those three feature sets, the classification of happy endings reaches an F1-score of 70% and increases to 73% when including the feature set for the final segment. Table 2 ​ : Classification F1-score for the different feature sets

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prediction of Happy Endings in German Novels based on Sentiment Information

Identifying plot structure in novels is a valuable step towards automatic processing of literary corpora. We present an approach to classify novels as either having a happy ending or not. To achieve this, we use features based on different sentiment lexica as input for an SVMclassifier, which yields an average F1-score of about 73%.

متن کامل

Prediction of Happy Endings in German Novels

Identifying plot structure in novels is a valuable step towards automatic processing of literary corpora. We present an approach to classify novels as either having a happy ending or not. To achieve this, we use features based on different sentiment lexica as input for an SVMclassifier, which yields an average F1-score of about 73%.

متن کامل

بررسی ویژگی محتوایی و شخصیت‌پردازی رمان‌های پرفروش نوجوان منتشر شده بین سال‌های 1389-1380

Purpose: To review the content and the characters of the parsonage in Persian youth best–selling novels between the 2000-2010s. Methodology: Quantitative content analysis was carried out on 14 novels for the youth reprinted more than ten times. Titles were chosen from the list prepared by the Book House. Findings: Most characters faced similar problems, such as poverty, physical illness or me...

متن کامل

MEFUASN: A Helpful Method to Extract Features using Analyzing Social Network for Fraud Detection

Fraud detection is one of the ways to cope with damages associated with fraudulent activities that have become common due to the rapid development of the Internet and electronic business. There is a need to propose methods to detect fraud accurately and fast. To achieve to accuracy, fraud detection methods need to consider both kind of features, features based on user level and features based o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1611.09028  شماره 

صفحات  -

تاریخ انتشار 2016